#Introduction
In order to get a sastisfying conclusion, getting and generating accurate datapoints are an extremely important process. More than simple just getting umbers, it is also important to understand other people’s view on the events. By using National laungage processing, REGDOCS were anaylzed and convertedinto a 2d vector and is then visualized. This new number quantifies how to writer describes the event.
#Methodology
In order to compare,constast and analyze each document, a maching learning model was created to convert each document into a 2d vector using doctocev. In doctovec, each vector had 500 dimensions. Since it impossible to visualize data with this many dimensions, tsne was incoportaed to reduce the amount of dimensions to 2. This allowed us to properly graphed and visualize in a 2d space.
Using the doctovec model, we can convert any document into a series of vectors. These vector can then be compared to other documents.
NLP converts the REGDOCS into a 2d vector by reviewing each document. Then the documents are compared to each other.
dataset <- read_csv("Dataset/finalData.csv")
dataset$submitter<- revalue(dataset$submitter,c("National Energy Board" = "NEB"))
x <- ggplot(dataset, aes(x = x,
y= y,
color = submitter,
key = name,
key2 = submitter,
key3 = date,
key4 = link)) +
geom_point(size = 0.5) +
theme(legend.position = 'none')+
labs(x = 'X value',
y = 'Y value',
title = 'REGDOCS information and value' )
ggplotly(x,tooltip = c("name","submitter","date","link"), dynamicTicks = TRUE, layerData = 1,
originalData = TRUE, source = "A")
#Further exploration We currently have a working model that can take a document and convert it to a vector for analysis, we can now apply this to the search engine and sort elements by their similarity. We can also give businesses who send in documents, similar documents as precedent and to give them case studies to help them in their future filings.
This model does not have to be limited to letters, it can apply to any categories and since we used selenium for search results, we aren’t dependent on which categories we pick.
The model is also saved to our directory, we can load the model at any time and train more datasets to it.
Another great extension would be to build a tsne of every document as opposed to just the letters we went through, tsne models are dependent on the data given so we would have to recall the funtion. This could give government officials a better overall view of how the unstructured data looks.s